Skip to content

Conversation

@penelopeysm
Copy link
Member

@penelopeysm penelopeysm commented Nov 12, 2025

Warning

This PR is not really meant for merging, it's mainly for discussion.

This PR replaces ThreadSafeVarInfo (which is essentially a vector of accumulators) with atomic accumulators, that is, mutable struct accumulators whose fields can be added to atomically.

(This idea was previously discussed in #1078, and quite a few of the ideas there will reappear here.)

Why?

Central to #1132 is the idea of removing Metadata. Since metadata has been removed, that means that at least for (Fast)LDF purposes, there is no longer anything blocking threaded assume tilde-statements.*

Here is a proof of principle. I want to be super clear in saying that the code changes in this PR are not needed to enable threaded assume. ThreadSafeVarInfo already works correctly, and the snippet below can be run on either #1132 or this PR. Benchmarks are provided below for both.

* There is one blocker, which is that to generate a FastLDF, you need a metadata to read the ranges and linked statuses. But you can't generate the metadata if there are threaded assume statements. The snippet here works around this by manually constructing the ranges. Solving this properly would require some surgery on get_range_and_linked, but should be doable (and indeed I believe that we should ultimately decouple get_range_and_linked from a metadata).

Performance

(This code is very poorly written, btw, it would be much better if expected_logpdf and get_ranges_and_linked dispatched on the model function. And yes the trivial model has a bogus y=nothing just to make those functions work.)

TSVI code is run with #1132, atomic accs code with this PR. Julia 1.11.7. Number of threads as shown below.

using DynamicPPL, Distributions, LogDensityProblems, ForwardDiff, ADTypes, Test, Chairmarks, Test

@model trivial(y=nothing) = x ~ Normal()
@model function threaded_asm_and_obs(y)
    x = Vector{Float64}(undef, length(y))
    Threads.@threads for i in eachindex(y)
        x[i] ~ Normal()
        y[i] ~ Normal(x[i])
    end
end
@model function threaded_obs_only(y)
    x ~ Normal()
    Threads.@threads for i in eachindex(y)
        y[i] ~ Normal(x)
    end
end

function expected_logpdf(x, y)
    p = sum(logpdf.(Normal(), x))
    y === nothing && return p # this is the trivial model
    # handle both models
    lk = if length(x) == length(y)
        sum(logpdf.(Normal.(x), y))
    else
        sum(logpdf.(Normal(x[1]), y))
    end
    return p + lk
end
function expected_grad(x, y)
    return ForwardDiff.gradient(Base.Fix2(expected_logpdf, y), x)
end

### Need to manually construct the ranges, because the default
### behaviour is to generate a VarInfo and extract ranges from that
### -- but generating the VarInfo would fail if the model has
### threaded assume.
struct NotAVarInfo <: DynamicPPL.AbstractVarInfo
    xs::Vector{Float64}
end
function DynamicPPL.Experimental.get_ranges_and_linked(v::NotAVarInfo)
    n_xs = length(v.xs)
    return if n_xs == 50
        all_iden_ranges = NamedTuple()
        all_dict_ranges = Dict{VarName,DynamicPPL.RangeAndLinked}()
        for i in 1:n_xs
            all_dict_ranges[@varname(x[i])] = DynamicPPL.RangeAndLinked(i:i, false)
        end
        all_iden_ranges, all_dict_ranges
    else
        (x=DynamicPPL.RangeAndLinked(1:1, false),), Dict{VarName,DynamicPPL.RangeAndLinked}()
    end
end
Base.getindex(v::NotAVarInfo, ::Colon) = v.xs

s = Threads.nthreads() > 1 ? "s" : ""
println("$(Threads.nthreads()) thread$s")
function test_atomic(model_constructor, xs, ys; adtype=AutoForwardDiff())
    model = model_constructor(ys)
    println("    $(model.f)")
    # Evaluation: correctness
    print("        eval ")
    ldf = DynamicPPL.Experimental.FastLDF(model, getlogjoint_internal, NotAVarInfo(xs))
    lp1 = LogDensityProblems.logdensity(ldf, xs)
    @test lp1  expected_logpdf(xs, ys)
    # Evaluation: perf
    display(median(@be LogDensityProblems.logdensity(ldf, xs)))
    # Gradient: correctness
    ldf = DynamicPPL.Experimental.FastLDF(model, getlogjoint_internal, NotAVarInfo(xs); adtype=adtype)
    lp2, grad = LogDensityProblems.logdensity_and_gradient(ldf, xs)
    @test lp2  expected_logpdf(xs, ys)
    @test grad  expected_grad(xs, ys)
    # Gradient: perf
    print("        grad ")
    display(median(@be LogDensityProblems.logdensity_and_gradient(ldf, xs)))
end

test_atomic(trivial, ones(1), nothing)
test_atomic(threaded_asm_and_obs, ones(50), zeros(50))
test_atomic(threaded_obs_only, ones(1), zeros(50))

# Both PRs (no threadsafe stuff)
#
# 1 thread
#     trivial
#         eval 22.459 ns (1 allocs: 16 bytes)
#         grad 45.738 ns (4 allocs: 128 bytes)
#     threaded_asm_and_obs
#         eval 30.125 μs (770 allocs: 20.156 KiB)
#         grad 207.167 μs (4369 allocs: 449.125 KiB)
#     threaded_obs_only
#         eval 23.792 μs (422 allocs: 12.125 KiB)
#         grad 26.708 μs (428 allocs: 14.766 KiB)
#
# This PR (atomic accumulators)
#
# 4 threads
#     trivial
#         eval 148.840 ns (10 allocs: 272 bytes)
#         grad 116.071 ns (7 allocs: 272 bytes)
#     threaded_asm_and_obs
#         eval 30.125 μs (989 allocs: 28.938 KiB)
#         grad 159.916 μs (3985 allocs: 264.266 KiB)
#     threaded_obs_only
#         eval 17.458 μs (445 allocs: 14.062 KiB)
#         grad 24.375 μs (445 allocs: 14.953 KiB)
#
# PR #1132 (ThreadSafeVarInfo)
#
# 4 threads
#     trivial
#         eval 573.864 ns (19 allocs: 976 bytes)
#         grad 210.262 ns (12 allocs: 1.156 KiB)
#     threaded_asm_and_obs
#         eval 28.958 μs (1194 allocs: 35.047 KiB)
#         grad 117.833 μs (4015 allocs: 533.875 KiB)
#     threaded_obs_only
#         eval 13.354 μs (454 allocs: 15.578 KiB)
#         grad 13.802 μs (450 allocs: 19.156 KiB)

TSVI or atomic accs?

Based on the above benchmarks, it looks like TSVI is faster when something threadsafe needs to be done (especially for gradients). But when there's no threadsafe evaluation needed (trivial model), TSVI introduces way more overhead.

It's not shown above, but I also tried making the trivial model much larger with fifty Normal() variables (in serial). TSVI again introduced way more overhead for evaluation (3.5x more than atomic accs), but oddly enough TSVI gradient was a slight bit faster.

Here's a table with some pros and cons:

TSVI Atomic Accs Explanation
Supports threaded assume with FastLDF True of both because no metadata
Generalises to all accumulators This PR only 'fixes' logp accumulators. Threadsafe versions of other accumulators must be separately patched. Of course, for LDF this is fine, but ultimately we would have to fix VAIM and others. This is not impossible, but to its credit, TSVI does fix it all in one fell swoop.
Avoids threadid indexing This is the core problem with Threads.@threads, and we're quite lucky to not have run into correctness issues so far with TSVI. But that might come back to bite us one day.
Performance It's a bit of a toss-up, see above. But I think that if we provide a way to opt out or in, then TSVI wins, because it's faster when actually needed. Although TSVI has large overheads for models which don't need it, if you can opt out, that wouldn't be an issue.
Fixes the convert_eltype issue Because the atomic accumulator is a mutable struct, you can't automatically promote its type midway through model evaluation, so the accumulator has to be instantiated with the correct type at the very beginning of evaluation. So you still need the whole float_type_with_fallback(eltype(params)) thing.
Code complexity (My opinion!) TSVI is basically a big patch on top of the whole thing, whereas accumulators are the 'correct' place to fix it.

@penelopeysm penelopeysm marked this pull request as draft November 12, 2025 01:37
@github-actions
Copy link
Contributor

Benchmark Report for Commit aceec5d

Computer Information

Julia Version 1.11.7
Commit f2b3dbda30a (2025-09-08 12:10 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Benchmark Results

┌───────────────────────┬───────┬─────────────┬───────────────────┬────────┬────────────────┬─────────────────┐
│                 Model │   Dim │  AD Backend │           VarInfo │ Linked │ t(eval)/t(ref) │ t(grad)/t(eval) │
├───────────────────────┼───────┼─────────────┼───────────────────┼────────┼────────────────┼─────────────────┤
│ Simple assume observe │     1 │ forwarddiff │             typed │  false │            6.6 │             1.8 │
│           Smorgasbord │   201 │ forwarddiff │             typed │  false │          739.0 │            43.8 │
│           Smorgasbord │   201 │ forwarddiff │ simple_namedtuple │   true │          438.8 │            56.9 │
│           Smorgasbord │   201 │ forwarddiff │           untyped │   true │          803.5 │            36.0 │
│           Smorgasbord │   201 │ forwarddiff │       simple_dict │   true │         6925.2 │            31.7 │
│           Smorgasbord │   201 │ forwarddiff │      typed_vector │   true │          769.2 │            41.8 │
│           Smorgasbord │   201 │ forwarddiff │    untyped_vector │   true │          792.4 │            37.4 │
│           Smorgasbord │   201 │ reversediff │             typed │   true │          925.7 │            45.9 │
│           Smorgasbord │   201 │    mooncake │             typed │   true │          734.0 │             6.1 │
│           Smorgasbord │   201 │      enzyme │             typed │   true │          916.5 │             4.1 │
│    Loop univariate 1k │  1000 │    mooncake │             typed │   true │         3967.5 │             5.7 │
│       Multivariate 1k │  1000 │    mooncake │             typed │   true │         1034.9 │             8.9 │
│   Loop univariate 10k │ 10000 │    mooncake │             typed │   true │        43600.9 │             5.3 │
│      Multivariate 10k │ 10000 │    mooncake │             typed │   true │         8902.5 │             9.8 │
│               Dynamic │    10 │    mooncake │             typed │   true │          125.4 │            11.1 │
│              Submodel │     1 │    mooncake │             typed │   true │            9.5 │             6.2 │
│                   LDA │    12 │ reversediff │             typed │   true │         1046.9 │             1.9 │
└───────────────────────┴───────┴─────────────┴───────────────────┴────────┴────────────────┴─────────────────┘

@mhauru
Copy link
Member

mhauru commented Nov 12, 2025

But I think that if we provide a way to opt out or in, then TSVI wins, because it's faster when actually needed. Although TSVI has large overheads for models which don't need it, if you can opt out, that wouldn't be an issue.

That's also assuming that people know of and use the opt out. I bet a lot of people wouldn't realise it's there.

I wonder if the overheads for atomic accs might be mostly because of mutable rather than @atomic. Even if so, not sure that helps us avoid them in any way. If I find the time I might do some more benchmarking of this to see how things scale with model size, but the overheads don't look promising.

Base automatically changed from py/ldf to breaking November 13, 2025 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants